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Abstract 

This paper tackles the problem of detecting abrupt changes in the mean of a het- 
eroscedastic signal by model selection, without knowledge on the variations of the 
noise. A new family of change-point detection procedures is proposed, showing that 
cross-validation methods can be successful in the heteroscedastic framework, whereas 
most existing procedures are not robust to heteroscedasticity. The robustness to het- 
eroscedasticity of the proposed procedures is supported by an extensive simulation 
study, together with recent theoretical results. An application to Comparative Ge- 
nomic Hybridization (CGH) data is provided, showing that robustness to heteroscedas- 
ticity can indeed be required for their analysis. 



1 Introduction 

The problem tackled in the paper is the detection of abrupt changes in the mean of a signal 
without assuming its variance is constant. Model selection and cross-validation techniques 
are used for building change-point detection procedures that significantly improve on ex- 
isting procedures when the variance of the signal is not constant. Before detailing the 
approach and the main contributions of the paper, let us motivate the problem and briefly 
recall some related works in the change-point detection literature. 



1.1 Change-point detection 

The change-point detection problem, also called one-dimensional segmentation, deals with 
a stochastic process the distribution of which abruptly changes at some unknown instants. 
The purpose is to recover the location of these changes and their number. This problem 
is motivated b y a wide range of applications, such as voice recognition, financial time- 



series analysis 29] and Comparative Genomic Hybridization (CGH) data analysi s |35l| . A 
large literature exists about change-point detection in many frameworks [see [l2j, LL7|, for a 
complete bibliography]. 

The first papers on change-point detection were devoted to the search for the location of 



a unique change-point, also named breakpoint [see[3J, for instance]. Looking for multiple 
change-points is a harder task and has been studied later. For instance, Yao [49|] used 
the BIC criterion for detecting multiple change-points in a Gaussian signal, and Miao and 
Zhao [33] proposed an approach relying on rank statistics. 



The setting of the paper is the following. The values Y\, . . . , Y n € R of a noisy signal 
at points t%, . . . , t n are observed, with 

Yi = s(U) + a{U)e % , E [e<] = and Var(e;) = 1 . (1) 

The function s is called the regression function and is assumed to be piecewise-constant, or 
at least well approximated by piecewise constant functions, that is, s is smooth everywhere 
except at a few breakpoints. The noise terms ei,...,e n are assumed to be independent 
and identically distributed. No assumption is made on a : [0,1] h- ► [0, oo). Note that all 
data (ij, y^)i<i<n are observed before detecting the change-points, a setting which is called 
off-line, 

As pointed out by Lavielle [2gj|, multiple change-point detection procedures generally 
tackle one among the following three problems: 

1. Detecting changes in the mean s assuming the standard-deviation a is constant, 

2. Detecting changes in the standard-deviation a assuming the mean s is constant, 

3. Detecting changes in the whole distribution of Y, with no distinction between changes 
in the mean s, changes in the standard-deviation a and changes in the distribution 
of e. 

In applications such as CGH data analysis, changes in the mean s have an important 
biological meaning, since they correspond to the limits of amplified or deleted areas of 
chromosomes. However in the CGH setting, the standard-deviation a is not always con- 
stant, as assumed in problem 1. See Section [6] for more details on CGH data, for which 
heteroscedasticity — that is, variations of a — correspond to experimental artefacts or bio- 
logical nuisance that should be removed. 

Therefore, CGH data analysis requires to solve a fourth problem, which is the purpose 
of the present article: 

4. Detecting changes in the mean s with no constraint on the standard-deviation a : 
[0,1] ^ [0,oo). 

Compared to problem 1, the difference is the presence of an additional nuisance parameter 
a making problem 4 harder. Up to the best of our knowledge, no change-point detection 
procedure has ever been proposed for solving problem 4 with no prior information on a. 



1.2 Model selection 

Model selection is a successful approach for multiple change-point detection, as shown by 
Lavielle 28] and by Lebarbier [30J| for instance. Indeed, a set of change-points — called a 
segmentation — is naturally associated with the set of piecewise-constant functions that may 
only jump at these change-points. Given a set of functions (called a model), estimation can 
be performed by minimizing the least-squares criterion (or other criteria, see Section [3]). 
Therefore, detecting changes in the mean of a signal, that is the choice of a segmentation, 
amounts to select such a model. 

More precisely, given a collection of models {S m } meM and the associated collection of 
least-squares estimators {%} m g^ ) the purpose of model selection is to provide a model 
index m such that reaches the "best performance" among all estimators {^m} m ^M ■ 
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Model selection can target two different goals. On the one hand, a model selection 
procedure is efficient when its quadratic risk is smaller than the smallest quadratic risk of 
the estimators {s^m} m& M ■> up to a constant factor C n > 1. Such a property is called an 
oracle inequality when it holds for every finite sample size. The procedure is said to be 
asymptotic efficient when the previous property holds with C n — > 1 as n tends to infinity. 
Asymptotic efficiency is the goal of AIC and Mallows' C p 32], among many others. 



On the other hand, assuming that s belongs to one of the models {S m } m£ j vtn , a pro- 
cedure is model consistent when it chooses the smallest model containing s asymptotically 
with probability one. Model consistency is the goal of BIC [39| for instance. See also the 



article by Yang 46y about the distinction between efficiency and model consistency. 



In the present paper as in [301 ] . the quality of a multiple change-point detection pro- 
cedure is assessed by the quadratic risk; hence, a change in the mean hidden by the noise 
should not be detected. This choice is motivated by applications where the signal-to-noise 
ratio may be small, so that exactly recovering every true change-point is hopeless. There- 
fore, efficient model selection procedures will be used in order to detect the change-points. 

Without prior information on the locations of the change-points, the natural collection 
of models for change-point detection depends on the sample size n. Indeed, there exist 
different partitions of the n design points into D intervals, each partition correspond- 
ing to a set of (D — 1) change-points. Since D can take any value between 1 and n, 2 n ~ 1 
models can be considered. Therefore, model selection procedures used for multiple change- 
point detection have to satisfy non- asymptotic oracle inequalities: the collection of models 
cannot be assumed to be fixed with the sample size n tending to infinity. (See Section l2~3l 
for a precise definition of the collection {S m } m <=M use d for change-point detection.) 

Most model selection results consider "polynomial" collections of models {S m } me j vin , 
that is Card(A / i„) < Cn a for some constants C, a > 0. For polynomial collections, proce- 
dures like AIC or Mallows' C p are proved to satisfy oracle inequalities in various frameworks 
@, EE, 13, 3], assuming that data are homoscedastic, that is, a(U) does not depend on ij. 



However as shown in [a], Mallows' C p is suboptimal when data are heteroscedastic, that 
is the variance is non-constant. Therefore, other procedures must be used. For instance, 
resampling penalization is optimal with heteroscedastic data [H]. Another approach has 



been explored by Gendre [2a], which consists in simultaneously estimating the mean and 



the variance, using a particular polynomial collection of models. 

However in change-point detection, the collection of models is "exponential", that is 
Card(.M ra ) is of order exp(cm) for some a > 0. For such large collections, especially larger 
than polynomial, the above penalization procedures fail. Indeed, Birge and Massart [la] 
proved that the minimal amount of penalization required for a procedure to satisfy an 
oracle inequality is of the form 

pen(m)= Cl — - + c 2 — -log^— j , (2) 

where c\ and c 2 are positive constants and a 2 is the variance of the noise, assumed to 



be constant. Lebarbier |30j proposed c\ = 5 and c 2 = 2 for optimizing the penalty J2]) 
in the context of change-point detection. Penalties similar to ([2]) have been introduced 
independently by other authors [3I, QIC 45] and are shown to provide satisfactory results. 
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Nevertheless, all these results assume that data are homoscedastic. Actually, the model 
selection problem with heteroscedastic data and an exponential collection of models has 
never been considered in the literature, up to the best of our knowledge. 

Furthermore, penalties of the form j2]) are very close to be proportional to D m , at least 
for small values of D m . Therefore, the results of [y] lead to conjecture that the penalty J2|) 
is suboptimal for model selection over an exponential collection of models, when data are 
heteroscedastic. The suggest of this paper is to use cross-validation methods instead. 



1.3 Cross-validation 



Cross-validation (CV) methods allow to estimate (almost) unbiasedly the quadratic risk of 
any estimator, such as s m (see Section [3T21 about the heuristics underlying CV). Classical 



examples of CV methods are the leave-one-out [Loo, |27|, |43J] and F-fold cross-validation 
[VFCV, 23, 24]. More references on cross-validation can be found in 0, Oil for instance. 

CV can be used for model selection, by choosing the model S m for which the CV 
estimate of the risk of s m is minimal. The properties of CV for model selection with 
a polynomial collection of models and homoscedastic data have been widely studied. In 
short, CV is known to adapt to a wide range of statistical settings, from density estimation 



13,01 to 



regression 



44 



481 ] and classification [26|, |47|]. In particular, Loo is asymptotically 
equivalent to AIC or Mallows' C p in several frameworks where they are asymptotically 
optimal, and other CV methods have similar performances, provided the size of the training 



sample is close enough to the sample size [see for instance [3l|, [40|, [22]]. In addition, CV 



methods are robust to heteroscedasticity of data 0,3, as well as several other resampling 
methods Therefore, CV is a natural alternative to penalization procedures assuming 
homoscedasticity. 

Nevertheless, nearly nothing is known about CV for model selection with an exponential 
collection of models, such as in the change-point detection setting. The literature on model 
selection and CV [14], 0,1 la, [2lJ only suggests that minimizing directly the Loo estimate 
of the risk over 2 n_1 models would lead to overfitting. 

In this paper, a remark made by Birge and Massart [lli] about penalization procedure 
is used for solving this issue in the context of change-point detection. Model selection is 
perfomed in two steps: First, choose a segmentation given the number of change-points; 
second, choose the number of change-points. CV methods can be used at each step, leading 
to Procedure [6] (Section [5]). The paper shows that such an approach is indeed successful 
for detecting changes in the mean of a heteroscedastic signal. 



1.4 Contributions of the paper 

The main purpose of the present work is to design a CV-based model selection proce- 
dure (Procedure EI) that can be used for detecting multiple changes in the mean of a 
heteroscedastic signal. Such a procedure experimentally adapts to heteroscedasticity when 
the collection of models is exponential, which has never been obtained before. In partic- 
ular, Procedure [6] is a reliable alternative to Birge and Massart's penalization procedure 
fl5| when data can be heteroscedastic. 

Another major difficulty tackled in this paper is the computational cost of resampling 
methods when selecting among 2 n models. Even when the number {D — 1) of change- 
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points is given, exploring the (nZi) partitions of [0, 1] into D intervals and performing a 
resampling algorithm for each partition is not feasible when n is large and D > 0. An 
implementation of Procedure [6] with a tractable computational complexity is proposed in 
the paper, using closed-form formulas for Leave-p-out (Lpo) estimators of the risk, dynamic 
programming, and F-fold cross-validation. 

The paper also points out that least-squares estimators are not reliable for change- 
point detection when the number of breakpoints is given, although they are widely used 
to this purpose in the literature. Indeed, experimental and theoretical results detailed in 
Section[3j]show that least-squares estimators suffer from local overfitting when the variance 
of the signal is varying over the sequence of observations. On the contrary, minimizers of 
the Lpo estimator of the risk do not suffer from this drawback, which emphasizes the 
interest of using cross-validation methods in the context of change-point detection. 

The paper is organized as follows. The statistical framework is described in Section El 
First, the problem of selecting the "best" segmentation given the number of change-points 
is tackled in Section [3j Theoretical results and an extensive simulation study show that 
the usual minimization of the least-squares criterion can be misleading when data are 
heteroscedastic, whereas cross-validation-based procedures provide satisfactory results in 
the same framework. 

Then, the problem of choosing the number of breakpoints from data is addressed in 
Section |U As supported by an extensive simulation study, l^-fold cross-validation (VFCV) 
leads to a computationally feasible and statistically efficient model selection procedure 
when data are heteroscedastic, contrary to procedures implicitly assuming homoscedastic- 
ity 

The resampling methods of Sections [3] and [4] are combined in Section EJ leading to a 
family of resampling-based procedures for detecting changes in the mean of a heteroscedas- 
tic signal. A wide simulation study shows they perform well with both homoscedastic and 
heteroscedastic data, significantly improving the performance of procedures which implic- 
itly assume homoscedasticity. 

Finally, Section [6] illustrates on a real data set the promising behaviour of the proposed 
procedures for analyzing CGH microarray data, compared to procedures previously used 
in this setting. 

2 Statistical framework 

In this section, the statistical framework of change-point detection via model selection is 
introduced, as well as some notation. 

2.1 Regression on a fixed design 

Let S* denote the set of measurable functions [0, 1] i— > M. Let t\ < ■ ■ ■ < t n G [0, 1] be 
some deterministic design points, s G S* and a : [0, 1] i— ► [0, oo) be some functions and 
define 

Vi G {l,...,n}, Y i = s{t i ) + a(t i )e i , (3) 

where e±, . . . , e n are independent and identically distributed random variables with E [e$] = 
and E [ef] = 1. 
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As explained in Section fTTTl the goal is to find from 5i)i<j< n a piecewise-constant 
function / S S* close to s in terms of the quadratic loss 

1 n 

ii*-/n» : =-£(/fe)- a fe)) a • 

2.2 Least-squares estimator 

A classical estimator of s is the least-squares estimator, defined as follows. For every 
/ £ 5*, the least-squares criterion at / is defined by 

n 

n ^— ' 

i=l 

The notation P n l{f) means that the function (t, Y) t— > 7(/;(t, Y)) := (1" — f(t)) 2 is 
integrated with respect to the empirical distribution P n := n _ z^iLi^fe.Y,)- P-nlif) is 
also called the empirical risk of /. 

Then, given a set S C 5* of functions [0,1] i— > E (called a model), the least-squares 
estimator on model 5 is 

ERM(S;P n ) :=argmin{P n7 (/)} . 

The notation ERM(5; P n ) stresses that the least-squares estimator is the output of the 
empirical risk minimization algorithm over S, which takes a model S and a data sample 
as inputs. When a collection of models {S m } me _M is given, s m (P n ) or s m are shortcuts 
for ERM(S m ; P n ). 

2.3 Collection of models 

Since the goal is to detect jumps of s, every model considered in this article is the set of 
piecewise constant functions with respect to some partition of [0, 1] . 

For every K € {1, . . . , n — 1} and every sequence of integers ao = 1 < ol\ < c*2 < ■ • • < 
olk < n (the breakpoints), (A)AeA ( ) denotes the partition 

of [0, 1] into (K + 1) intervals. Then, the model Sr aii ___ aK \ is defined as the set of piecewise 
constant functions that can only jump at t = t aj for some j £ {1, . . . , K}. 

For every K G {l,...,n — 1}, let A4 n (-fT + 1) denote the set of such sequences 
(a±, . . . ok) of length K, so that {S m } m( -j^ (k+i) m ^ ne collection of models of piece- 
wise constant functions with K breakpoints. When K = 0, M n (l) := {0} and the 
model Sid is the linear space of constant functions on [0,1]. Remark that for every K 
and m G A4 n (-fT + 1), S m is a vector space of dimension D m = K + 1. In the rest of the pa- 
per, the relationship between the number of breakpoints K and the dimension D = K + 1 
of the model <S7 aij ...a K ) is used repeatedly; in particular, estimating of the number of break- 
points (Section [4]) is equivalent to choosing the dimension of a model. In addition, since a 
model S m is uniquely defined by m, the index m is also called a model. 
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The classical collection of models^ for change-point detection can now be defined as 
{Sm} mg _A4 ' wnere M-n = Uuerv M-n{P) and V n = {l,...,n}. This collection has a 
cardinality 2 n_1 . 

In^this paper, a slightly smaller collection of models is considered, that is, all 
m £ M. n such that each element of the partition {I\)\ & \ m contains at least two design 
points {tj)i<j< n - Indeed, when nothing is known about the noise-level cr(-) , one can- 
not hope to distinguish two consecutive change-points from a local variation of a. For 
every D £ {l,...,n}, let M. n {D) denote the set of m E M. n (D) satisfying this prop- 
erty. Then, the collection of models used in this paper is defined as {S m } m£M where 
A4 n = \j£)£T> n -M.n(D) and V n C {l,...,n/2}. Finally, in all the experiments of the 
paper, V n = {!,... ,4n/10} for reasons detailed in Section l4~2l in particular Remark[3l 



2.4 Model selection 

Among {Sm\ m ^M > ^ ne ^est m °del is defined as the minimizer of the quadratic loss 
\\s — 'SmWn over 171 e an d called the oracle m*. Since the oracle depends on s, one can 
only expect to select fh{P n ) from the data such that the quadratic loss of is close to 
that of the oracle with high probability, that is, 

\\s - Sfn\\ 2 n < C inf \ II s ~ Srn\\l \ + Rn (4) 

where C is close to 1 and R n is a small remainder term (typically of order n _1 ). Inequality 
((4)) is called an oracle inequality. 



3 Localization of the breakpoints 



A usual strategy for multiple change-point detection [28, 3Qfl is to dissociate the search for 
the best segmentation given the number of breakpoints from the choice of the number of 
breakpoints. 

In this section, the number K = D — 1 of breakpoints is fixed and the goal is to localize 
them. In other words, the goal is to select a model among {S m } m€Mn ^ D y 

3.1 Empirical risk minimization's failure with heteroscedastic data 



As explained by many authors such as Lavielle |28|, minimizing the least-squares criterion 
over {sm} rn& M(D) i s a classical way of estimating the best segmentation with (D — 1) 
change-points. This leads to the following procedure: 

Procedure 1. 



mvRM(D) := arg min {P n ^(s m )} = ERM S D ; P Tl 

m£M n {D) V 

where S D := U me _ Mn(D )S' m 

is the set of piecewise constant functions with exactly (D — l) change-points, chosen among 
ti , . . . , t n (see Section 12 . 3[) . 
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Figure 1: Comparison of ^ m *(D) (dotted black line) 
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(B) (dashed blue line) and 



Sfh Loo (D) (plain magenta line, see Section I3.2.2I) . D being the "optimal" dimension (see 
Figured]). Data are generated as described in Section [3.3. II with n = 100 data points. 
Left: homoscedastic data (s2,cr c ), D = 4. Right: heteroscedastic data (S3, <7pc,3), D = 6. 



Remark 1. Dynamic programming [l_3j leads to an efficient implementation of Procedure Q] 
with computational complexity O (ra 2 ) . 

Among models corresponding to segmentations with (D — 1) change-points, the oracle 
model can be defined as 

m*(D) : = arg min \ \\s - s m \\^ \ . 

Figure [T] illustrates how far mERM(-D) typically is from m*(D) according to variations of 
the standard-deviation a. On the one hand, when data are homoscedastic, empirical risk 
minimization yields a segmentation close to the oracle (Figure Q], left). On the other hand, 
when data are heteroscedastic, empirical risk minimization introduces artificial breakpoints 
in areas where the noise-level is above average, and misses breakpoints in areas where the 
noise-level is below average (Figure [U right). In other words, when data are heteroscedas- 
tic, empirical risk minimization over Sd locally overfits in high-noise areas, and locally 
underfits in low-noise areas. 

The failure of empirical risk minimization with heteroscedastic data observed on Fig- 



ure[T]is general [2lJ, Chapter 7] and can be explained by LemmaCDbelow. Indeed, the criteria 
Pnl(sm) and ||s — Sjnlljj, respectively minimized by w-erm(-D) and m*(D) over M. n (D 
are close to their respective expectations, as proved by the concentration inequalities of 
Proposition 9] for instance. Lemma Q] enables to compare these expectations. 

Lemma 1. Let m G M. n and define s m := argminj e 5 m ||s — f\\ 2 n . Then, 

1 n 

E[P n7 (s m )] = \\s-s m \\ 2 n -V(m) + -J2°(ti) 2 (5) 



• II 2 

m\\ n 



E 
where 

,2 



\s~ s m \\l + V{m) (6) 



8 



Lemma [T] is proved in [2l|. As it is well-known in the model selection literature, the 



expectation of the quadratic loss ([6]) is the sum of two terms: I s — s m I L is the bias of 
model S m , and V(m) is a variance term, measuring the difficulty of estimating the D m 
parameters of model S m . Up to the term rT 1 Y2i=i °~{ti) 2 which does not depend on m, the 
empirical risk underestimates the quadratic risk (that is, the expectation of the quadratic 
loss), as shown by J5)), because of the sign in front of V{m). 

Nevertheless, when data are homoscedastic, that is when Vi, c(ij) = cf, V(m) = 
Dm^ri" 1 is the same for all m G A4 n (D). Therefore, J5|) and J6]) show that for every 
D > 1, when data are homoscedastic 

arg min {E [P n ~f (s m )]} = arg min |l ||s-s m ||^ \\ . 

Hence, friERM(D) and m*{D) tend to be close to one another, as on the left of Figured) 

On the contrary, when data are heteroscedastic, the variance term V(m) can be quite 
different among models m £ Ai n (D), even though they have the same dimension D. 
Indeed, V(m) increases when a breakpoint is moved from an area where a is small to an 
area where a is large. Therefore, the empirical risk minimization algorithm rather puts 
breakpoints in noisy areas in order to minimize — V(m) in (J5j) - This is illustrated in the 
right panel of Figure [TJ where the oracle segmentation m*(D) has more breakpoints in 
areas where a is small. 



3.2 Cross-validation 

Cross-validation (CV) methods are natural candidates for fixing the failure of empirical 
risk minimization when data are heteroscedastic, since CV methods are naturally adaptive 
to heteroscedasticity (see Section [0]) . The purpose of this section is to properly define 
how CV can be used for selecting m S Ai n (D) (Procedure [2]), and to recall theoretical 
results showing why this procedure adapts to heteroscedasticity (Proposition [Q. 



3.2.1 Heuristics 

The cross-validation heuristics (3, H3] relies on a data splitting idea: For each candidate 
algorithm — say ERM(S' m ;-) for some m £ M. n (D) — , part of the data — called training 
set — is used for training the algorithm. The remaining part — called validation set — is used 
for estimating the risk of the algorithm. This simple strategy is called validation or hold- 
out. One can also split data several times and average the estimated values of the risk over 
the splits. Such a strategy is called cross-validation (CV). CV with general repeated splits 
of data has been introduced by Geisser 11, 24]. 

In the fixed-design setting, (ij,^)i<i< n are not identically distributed so that CV 
estimates a quantity slightly different from the usual prediction error. Let T be uniformly 
distributed over {ii, • • • , t n } and Y — s(T^ -\- cr(r)6, where 6 is independent from €\, . . . , t n 
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with the same distribution. Then, the CV estimator of the risk of s(P n ) estimates 



E 



(T,Y) 



(s(T) - Yf 



1 n 

- VE £ \{s{ti) + a(t i )e i -s(t i )f 
n ^-^ L 

i=l 

1 n 



i=l 



over m amounts to minimize 



Hence, minimizing the CV estimator of ^(t,y) (sm(T) — Y)' 

\\s — s m || n , up to estimation errors. 

Even though the use of CV in a fixed-design setting is not usual, theoretical results 
detailed in Section 13.2.41 below show that CV actually leads to a good estimator of the 
quadratic risk \\s — s m \\ n . This fact is confirmed by all the experimental results of the 
paper. 



3.2.2 Definition 

Let us now formally define how CV is used for selecting some m £ Ai n (D) from data. A 
(statistical) algorithm A is defined as any measurable function P n 1— > A(P n ) € <S* ■ For any 
ti £ [0, 1], A{ti]P n ) denotes the value of A(P n ) at point ij. 
For any jW C {l,...,n}, define iW := {1, . . . , n} 



P (t) - 



1 



Card(/(*)) 



E alld P n ] ■= C^d(/W) S • 



Then, the hold-out estimator of the risk of any algorithm A is defined as 

R ho (A, p n , /W) := pM 7 (pw)) = * ; E ^) - « 

The cross-validation estimators of the risk of A are then defined as the average of 
Rhn (A, P n , J- ) over j = 1, . . . , P where j{ , . . . , Pg are chosen in a predetermined way 
241 ]. Leave-one-out, leave-p-out and V-fold cross-validation are among the most classical 



examples of CV procedures. They differ one another by the choice of lf \ . . . , Pg . 



Leave-one-out (Loo), often called ordinary CV [1, 43], consists in training with the 
whole sample except one point, used for testing, and repeating this for each data 
point: Ij = {1, . . . , n} \ {j} for j = 1, . . . , n. The Loo estimator of the risk of A is 
defined by 



1 n r 

Rhoo(A, Pn) • — — ^ ^ (^i' "^ji ^ 



(—7) 
where P„ 



i,i^jO{U,Yi) 



Leave-p-out (Lpo„, with any p £ {1, . . . , n — 1}) generalizes Loo. Let £ p denote the 
collection of all possible subsets of {!,..., n} with cardinality n — p. Then, Lpo 
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consists in considering every JW g £ as training set indices: 



(8) 



V-fold cross-validation (VFCV) is a computationally efficient alternative to Lpo and 
Loo. The idea is to first partition the data into V blocks, to use all the data but 
one block as a training sample, and to repeat the process V times. In other words, 
VFCV is a blockwise Loo, so that its computational complexity is V times that 

(~B~ ) 

of A. Formally, let Bi, . . . ,By be a partition of {l,...,n} and P« := {n — 
C&r&iBk))- 1 Y^i^Bk^iUK) for every k £ {1, ...,V}, The VFCV estimator of the 
risk of A is defined by 



Ry Fy (A,P n ) : =yYl 



fc=l 



1 



Card(£ fc 



jeB k 



(9) 



The interested reader will find theoretical and experimental results on VFCV and 
the best way to use it in @, [ED] and references therein, in particular fl8| . 



Given the Loo estimator of the risk of each algorithm A among {ERM(S' m ; OlmeA-fnfD)' 
the segmentation with (D — 1) breakpoints chosen by Loo is defined as follows. 

Procedure 2. 

m Loo (£>):= arg min \ R Loo {ERM(S m ; ■) , P n )\ . 

The segmentations chosen by Lpo and VFCV are defined similarly and denoted respectively 
by m LpQp (D) and by m VFv (D). 

As illustrated by Figure [TJ when data are heteroscedastic, fhi j00 {D) is often closer to 
the oracle segmentation m*(D) than iriERM(D). This improvement will be explained by 
theoretical results in Section 13,2.41 below. 



3.2.3 Computational tractability 

The computational complexity of ERM(S' m ;P n ) is 0(n) since for every A G A m , the value 
of s m (P n ) on I\ is equal to the mean of {Xi\t Therefore, a naive implementation of 

Lpo p has a computational complexity O which can be intractable for large n in 

the context of model selection, even when p = 1. In such cases, only VFCV with a small 
V would work straightforwardly, since its computational complexity is 0(nV). 

Nevertheless, closed-form formulas for the Lpo estimator of the risk have been derived 
in the density estimation 20, Oil] and regression [2l| frameworks. Some of these closed- 
form formulas apply to regressograms s m with m £ .A4n- The following theorem gives a 
closed-form expression for i?Lpo ijn) := i?Lpo (ERM(5 m ; •), P n ) which can be computed 



with 0(n) elementary operations. 
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Theorem 1 (Corollary 3.3.2 in [2]J). Let m G .M n , S m and s m = ERM(S m ; •) be defined 
as in Section^ For every (t\, Y\), (t n , Y n ) G M. 2 and A G A m , define 

n n 

S \i ■= Z^fe^} and 5 A,2 :=^^/l{i je / A } • 
3=1 3=1 

Then, /or even/ p 6 {1, ... ,n — 1}, £/ie Lpo p estimator of the risk ofs m defined by (jHJ) is 
given fry 

^Lpo p (m)= £ -L[{ {Ax -B x )S x , 2 + B x Sl 1 }t {nx > 2} + {+^}l {nx=1} ] , 
AeA m P A 



where for every A G A m , 

n x := Card({i | U G 



N x := 1 - 1 



71 ~ nA )/( n 



A 



V x (0) ( 1 - — 



VxiX) 



+ V X (-1) 



Bx ■■= v x (iy 



ln A >3 + /a(0) 



nx(nx - 1) ' n x - 1 
and V&G {-1,0,1}, V\(k) : 



1 



1 + — ln A >3"2 

n x ) 

min{n A ,(n— p)} 
r=max{l,(p-n A )} 



n A >3 



n A - 1 
r I \n\—r) 



Vn A / 



Remark 2. can also be written as E [Z k tz>o] where Z has hypergeometric distri- 

bution with parameters (n, n — p,n\). 

An important practical consequence of Theorem [T] is that for every D and p, mi lp0p (D) 
can be computed with the same computational complexity as fhERu(D), that is O {n 2 ). 
Indeed, Theorem Q] shows that R\ jp0p (m) is a sum over A G A m of terms depending only 
on {Yi} t . eI , so that dynamic programming [T^] can be used for computing the mini- 

mizer mL p0p (-D) of Rhpo ( m ) over m G A4 n . Therefore, Lpo and Loo are computationally 
tractable for change-point detection when the number of breakpoints is given. 

Dynamic programming also applies to fhyp v with a computational complexity 
O (Vn 2 ^ , since each term appearing in Rvf v (th) is the average over V quantities that 
must be computed, except when V = n since VFCV then becomes Loo. Since VFCV is 
mostly an approximation to Loo or Lpo but has a larger computational complexity, w,Lpo p 
will be preferred to myF v (D) in the following. 

3.2.4 Theoretical guarantees 

In order to understand why CV indeed works for change-point detection with a given 
number of breakpoints, let us recall a straightforward consequence of Theorem [T] which is 
proved in details in [2l|, Lemma 7.2.1 and Proposition 7.2.3]. 

Proposit ion 1. Using the notation of LemmaUl for any m G -A/i ra , 



E 



AeA m i=i 



(10) 
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Figure 2: Regression functions si, s%, S3; s\ and S2 are piecewise constant with 4 jumps; S3 
is piecewise constant with 9 jumps. 

where the approximation holds as soon as mm\ € \ m n\ is large enough (in particular larger 
than p). 

The comparison of J6]) and (fTOl) shows that Lpo p yields an almost unbiased estimator 
of ||s — sin ||n : The only difference is that the factor 1/n in front of the variance term V(m) 
has been changed into l/(n —p). Therefore, minimizing the Lpo p estimator of the risk 
instead of the empirical risk allows to automatically take into account heteroscedasticity 
of data. 

3.3 Simulation study 

The goal of this section is to experimentally assess, for several values of p, the performance 
of Lpo p for detecting a given number of changes in the mean of a heteroscedastic signal. 
This performance is also compared with that of empirical risk minimization. 

3.3.1 Setting 

The setting described in this section is used in all the experiments of the paper. 

Data are generated according to j3]) with n = 100. For every i, t\ = i/n and e« 
has a standard Gaussian distribution. The regression function s is chosen among three 
piecewise constant functions si,S2,S3 plotted on Figure[2l The model collection described 
in Section l2~3l is used with V n = {1, . . . , 4n/10}. The noise-level function cr(-) is chosen 
among the following functions: 

1. Homoscedastic noise: a c = 0.25 l[o,i] , 

2. Heteroscedastic piecewise constant noise: a pCt i = 0.21[ 0i i/3] + 0. 051(1/3^] , o VC) 2 = 
2a pc< i or cr pCi 3 = 2.5cr pCi i . 

3. Heteroscedastic sinusoidal noise: a s = 0.5 sin (tir/A). 

All combinations between the regression functions (si)i=i t 2,3 and the five noise-levels 
a. have been considered, each time with N = 10 000 independent samples. Results below 
only report a small part of the entire simulation study but intend to be representative 
of the main observed behaviour. A more complete report of the results, including other 
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Figure 3: E 



15 20 25 
Dimension D 



s fh v (D) 



15 20 25 
Dimension D 



as a function of D for among 'ERM' (empirical risk 



minimization), 'Loo' (Leave-one-out), 'Lpo(20)' (Lpo p with p = 20) and 'Lpo(50)' (Lpo p 
with p = 50). Left: homoscedastic (s2,cr c ). Right: heteroscedastic (s3,a pc ^). All curves 
have been estimated from N = 10 000 independent samples; error bars are all negligible in front of 
visible differences (the larger ones are smaller than 8.10 -5 on the left, and smaller than 2.10 -4 on 

II ^ 1 1 2 

the right). The curves Dh 1 1 s — Sffi^ ( 1 1 „ behave similarly to their expectations. 



regression functions s and noise-level functions cr, is given in the second authors' thesis [21 
Chapter 7]; see also Section 3 of the supplementary material. 



3.3.2 Results: Comparison of segmentations for each dimension 

The segmentations of each dimension D £ T> n obtained by empirical risk minimization 
('ERM', Procedure [TJ) and Lpo p (Procedure [2]) for several values of p are compared on Fig- 

r ^ 2] 

ure[3j through the expected values of the quadratic loss E s — 'Sfn v (D) f° r procedure 

On the one hand, when data are homoscedastic (Figure [H left), all procedures yield 
similar performances for all dimensions up to twice the best dimension; Lpo p performs 
significantly better for larger dimensions. Therefore, unless the dimension is strongly over- 
estimated (whatever the way D is chosen), all procedures are equivalent with homoscedastic 
data. 

On the other hand, when data are heteroscedastic (Figure (H right), ERM yields signifi- 
cantly worse performance than Lpo for dimensions larger than half the true dimension. As 
explained in Sections 13.11 and T3.2.41 friERM(D) often puts breakpoints inside pure noise for 
dimensions D smaller than the true dimension, whereas Lpo does not have this drawback. 
Therefore, whatever the choice of the dimension (except D < 4, that is for detecting the 
obvious jumps), Lpo should be prefered to empirical risk minimization as soon as data are 
heteroscedastic. 
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s . 


G . 


ERM 


Loo 


t 20 




2 


c 


2.88 ± 0.01 


2.93 ± 0.01 


2.93 ± 0.01 


2.94 ± 0.01 




pc,l 


1.31 ± 0.02 


1.16 ± 0.02 


1.14 ± 0.02 


1.11 ± 0.01 




pc,3 


3.09 ± 0.03 


2.52 ± 0.03 


2.48 ± 0.03 


2.32 ± 0.03 


3 


c 


3.18 ± 0.01 


3.25 ± 0.01 


3.29 ± 0.01 


3.44 ± 0.01 




pc,l 


3.00 ± 0.01 


2.67 ± 0.02 


2.68 ± 0.02 


2.77 ± 0.02 




pc,3 


4.41 ± 0.02 


3.97 ± 0.02 


4.00 ± 0.02 


4.11 ± 0.02 



Table 1: Average performance C or ([*P, Id]]) for change-point detection procedures *}3 among 
ERM, Loo and Lpo p with p = 20 and p = 50. Several regression functions s and noise-level 
functions a have been considered, each time with N = 10 000 independent samples. Next 
to each value is indicated the corresponding empirical standard deviation divided by y/~N, 
measuring the uncertainty of the estimated performance. 



3.3.3 Results: Comparison of the "best" segmentations 



This section focuses on the segmentation obtained with the best possible choice of D, that 

2 



S%,(D) 



(plotted on Figure [3]) 
20 and p = 50. Therefore, the 











2 




E 


infi<£>< n | 


s ~~ s ih v (D) 




E 


inf me ;w„ 


| s s m n } 





is the one corresponding to the minimum of D i— 

for procedures *P among ERM, Loo, and Lpo p with p 
performance of a procedure is defined by 



C OT (|PP,Id]):= 



which measures what is lost compared to the oracle when selecting one segmentation 
m«p(L>) per dimension. Even if the choice of D is a real practical problem — which will 
be tackled in the next sections — , C or (PP, Id]) helps to understand which is the best 
procedure for selecting a segmentation of a given dimension. The notation C or ([^}, Id]) 
has been chosen for consistency with notation used in the next sections (see Section l5~Tj) . 

Table Q] confirms the results of Section 13.3.21 On the one hand, when data are ho- 
moscedastic, ERM performs slightly better than Loo or Lpo p . On the other hand, when 
data are heteroscedastic, Lpo p often performs better than ERM (whatever p), and the 
improvement can be large (more than 20% in the setting (s2, cr pc ^)). Overall, when ho- 
moscedasticity of the signal is questionable, Lpo p appears much more reliable than ERM 
for localizing a given number of change-points of the mean. 

The question of choosing p for optimizing the performance of Lpo p remains a widely 
open problem. The simulation experiment summarized with Table Q] only shows that Lpo p 
improves ERM whatever the optimal value of p depending on s and a. 



4 Estimation of the number of breakpoints 

In this section, the number of breakpoints is no longer fixed or known a priori. The goal 
is precisely the estimation of this number, as often needed with real data. 

Two main procedures are considered. First, a penalization procedure introduced by 
Birge and Massart [l5[] is analyzed in Section 14.11 this procedure is successful for change- 
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point detection when data are homoscedastic [28|, [30|]. On the basis of this analysis, V- 
fold cross-validation (VFCV) is then proposed as an alternative to Birge and Massart's 
penalization procedure (BM) when data can be heteroscedastic. 

In order to enable the comparison between BM and VFCV when focusing on the ques- 
tion of choosing the number of breakpoints, VFCV is used for choosing among the same 
segmentations as BM, that is {mERtvi(-D)} D&Vn - The combination of VFCV for choosing 
D with the new procedures proposed in Section [3] will be studied in Section 

4.1 Birge and Massart's penalization 

First, let us define precisely the penalization procedure proposed by Birge and Massart 



1 51 ] successfully used for change-point detection in [28j, [3fJ. 
Procedure 3 (Birge and Massart [l5|). 

1. Vm G M n , s m := ERM (S m ; P n ) . 

2. m BM := argmin me<Mni£ , me xi n {P n 7(? m ) + pen BM (m)} , where for every m G M m 
the penalty pen BM (m) only depends on S m through its dimension: 

pen BM M = pen BM (An) := ( 5 + 21o S (/J - )) ' ^ 

where C is estimated from data using Birge and Massart's slope heuristics lfj, 8|, as 
proposed by Lebarbier [30|] and by Lavielle [23]. See Section 1 of the supplementary 



material for a detailed discussion about C . 

3. S B M := Sjn BM . 

All m G M. n {D) are penalized in the same way by pen BM (m), so that Procedure [3] 
actually selects a segmentation amon g ffi tEKM(-P)}pep ■ Therefore, Procedure [3] can be 
reformulated as follows, as noticed in [la, Section 4.3]. 

Procedure 4 (Reformulation of Procedure [3]). 

1. V£> G £>„, Sfh ERM{D) := ERM (s D ;P n j where S D := \J meMn(D) S m . 

2. D BM := argmin DeI ) n {^n7(«m E R M (B)) + P en BM( D )} where pen BM (L>) is defined by 



3. S BM ■- S™ brm (O bm ) ■ 

In the following, 'BM' denotes Procedure [4] and 

crit BM (£>) := P„7(Sm ERM (D)) +pen BM (-D) 

is called the BM criterion. 

Procedure [4] clarifies the reason why pen BM must be larger than Mallows' C p penalty. 
Indeed, for every m G M n , Lemma[T]shows that when data are homoscedastic, -P n 7( s m ) + 
pen(m) is an unbiased estimator of \\s — s m || n when pen(m) = 2cr 2 Z) m n~ 1 , that is Mallows' 
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Figure 4: Comparison of the expectations of ||s — Sm(r>)|| n ('Loss'), critvFy(-D) ('VFs') 
and criteM^) ('BM'). Data are generated as explained in Section I3.3.11 Left: ho- 
moscedastic (s2,cr c ). Right: heteroscedastic (s2,cr pc ^). Expectations have been estimated 
from N = 10 000 independent samples; error bars are all negligible in front of visible differences 
(the larger ones are smaller than 5.10 -4 on the left, and smaller than 2.10 -3 on the right). Similar 
behaviours are observed for every single sample, with slightly larger fluctuations for critvF v (D) 
than for critBM(-D)- The curves 'BM' and 'VPs' have been shifted in order to make comparison 
with 'Loss' easier, without changing the location of the minimum. 



C p penalty. When Card(.M n ) is at most polynomial in n, Mallows' C p penalty leads to an 



efficient model selection procedure, as proved in several regression frameworks [411. 1 311. Il( 
Hence, Mallows' C p penalty is an adequate measure of the "capacity" of any vector space 
S m of dimension D m , at least when data are homoscedastic. 

On the contrary, in the change-point detection framework, Card(.M n ) grows exponen- 
tially with n. The formulation of Procedure |4] points out that pen BM (Z?) has been built 

11^ 1 1 2 ^. 

so that critBM(-D) estimates unbiasedly ||s — s m ERM (r>) || f° r every D, where s^ermCD) is 

the empirical risk minimizer over Sd- Hence, pen BM (Z)) measures the "capacity" of Sd, 
which is much bigger than a vector space of dimension D. Therefore, peno M should be 



larger than Mallows' C p , as confirmed by the results of Birge and Massart (la] on minimal 
penalties for exponential collections of models. 

Simulation experiments support the fact that critBM(^) is an unbiased estimator of 

11 ^ 1 1 2 

\\s — 'Sfh(D)\\ n f° r every D (up to an additive constant) when data are homoscedastic 
(Figure |4] left). However, when data are heteroscedastic, theoretical results proved by 

Birge and Massart [l5|, EH] no longer apply, and simulations show that criteM(^) does 

. . . . 2 

not always estimate ||s— ^m ERM (-D)|| n we ll (Figure [4] right). This result is consistent with 
Lemma LH as well as the suboptimality of penalties proportional to D m for model selection 
among a polynomial collection of models when data are heteroscedastic @|. 

Therefore, pen BM (L>) is not an adequate capacity measure of Sd in general when data 
are heteroscedastic, and another capacity measure is required. 

4.2 Cross-validation 

As shown in Section 13.2.21 CV can be used for estimating the quadratic loss \\s — ^l(P n )||^ 
for any algorithm A. In particular, CV was successfully used in Section [3] for estimating 
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the quadratic risk of ERM(5 m ; •) for all segmentations m G Ai n (D) with a given number 
(D — 1) of breakpoints (Procedure [2]), even when data are heteroscedastic. 

Therefore, CV methods are natural candidates for fixing BM's failure. The proposed 
procedure — with VFCV — is the following. 

Procedure 5. 

1. VD G V n , 5a ERM p) := ERM (s D ;P n ) , 

2. £>vf v := argmin D6 X)„ {crityFvC- )} 

where critvFv(^) := Pvf v ^ERM ^<Sx>(-); -J , and Rvf v is defined by ([9]). 

Remark 3. In algorithm 5i)i<i< n i— ► ERM fs^; P n J , the model 5d depends on the 

design points. When the training set is (ti,Yi)i^ Bk , the model Sd is the union of the 
S m such that VA G A m , i\ contains at least two elements of {fj s.t. i ^ Bk}. Such an m 
exists as soon as D < (n — max/j {Card(Bk)})/2 and two consecutive design points U,ti + i 
always belong to different blocks Bk, which is always assumed in this paper. Note that the 
dynamic programming algorithms [l3[] quoted in Section 13.2.31 can straightforwardly take 
into account such constraints when minimizing the empirical risk over Srj. 

The dependence of Sd on the design explains why crityF^(-D) decreases for D close to 
n(V — 1)/(2V), as observed on FigureH] Indeed, when D is close to n*/2 (where nt is the 
size of the design), only a few {S m } meMn remain in Sd] for instance, when D = n t /2, 

Sd is equal to one of the {S m } meMn ( D y Therefore, the "capacity" of Sd decreases in the 
neighborhood of D = rit/2. 

Similar procedures can be defined with Loo and Lpo p instead of VFCV. The interest 
of VFCV is its reasonably small computational cost — taking V < 10 for instance — , since 
no closed- form formula exists for CV estimators of the risk of ERM I Sd', P n ) • 



4.3 Simulation results 



A simulation experiment was performed in the setting presented in Section 13.3.11 for com- 
paring BM and VFy with V = 5 blocks. A representative picture of the results is given by 



Figure [4] and by Table [2] [see [2l|, Chapter 7, and Section 3 of the supplementary material 
for additional results]. 

As illustrated by Figured], critvFv(^) can be used for measuring the capacity of Sd- 
Indeed, VFCV correctly estimates the risk of empirical risk minimizers over Sd for every 
D and for both homoscedastic and heteroscedastic data; critvF v (^) only underestimates 

II ^ 1 1 2 

|| s — 'sfh(D)\\ fo r dimensions D close to n(V — 1)/(2V), for reasons explained at the end 

.. 1 1 2 

of Remark [3] On the contrary, critBM(-D) is a poor estimate of s — 'sfh(D) \ " 
are heteroscedastic. 

Subsequently, VFCV yields a much smaller performance index 



C or ([ERM,<p]) := 



when data 



E 


S S fn EB .M(D^) 


2" 

n 




E 


inf m6 X„ ' 


'\\s-s m (P n )\\ 2 n }'_ 
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s. 


a. 


Oracle 


VF 5 


BM 


2 


c 

pc,2 

s 


2.88 ± 0.01 
2.88 ± 0.02 
3.01 ± 0.01 


4.51 ± 0.03 
6.58 db 0.06 
5.21 ± 0.04 


5.27 ± 0.03 
19.82 ± 0.07 
9.69 ± 0.40 


3 


c 

pc,2 

s 


3.18 ± 0.01 
4.06 ± 0.02 
4.02 ± 0.01 


4.41 ± 0.02 
5.99 ± 0.02 
5.97 db 0.03 


4.39 ± 0.01 
7.86 ± 0.03 
7.59 db 0.03 



Table 2: Performance C or ([ERM, ^3]) for <p = Id (that is, choosing the dimension D* := 
argmin De £> n |||s - Sm ERM (D) ||^}), V = VFy with V = 5 or = BM. Several regression 
functions s and noise-level functions a have been considered, each time with N = 10 000 
independent samples. Next to each value is indicated the corresponding empirical standard 
deviation divided by y/~N, measuring the uncertainty of the estimated performance. 



than BM when data are heteroscedastic (Table [21); see also the supplementary material 
(Section 1) for details about the performances of BM and possible ways to improve them. 
When data are homoscedastic, VFCV and BM have similar performances (maybe with a 
slight advantage for BM), which is not surprising since BM uses the knowledge that data 
are homoscedastic. Moreover, BM has been proved to be optimal in the homoscedastic 



setting [la, llfl] 



Overall, VFCV appears to be a reliable alternative to BM when no prior knowledge 
guarantees that data are homoscedastic. 



5 New change-point detection procedures via cross- 
validation 

Sections [3] and |4] showed that when data are heteroscedastic, CV can be used successfully 
instead of penalized criteria for detecting breakpoints given their number, as well as for 
estimating the number of breakpoints. Nevertheless, in Section HI the segmentations com- 
pared by CV were obtained by empirical risk minimization, so that they can be suboptimal 
according to the results of Section [3] 

The next step for obtaining reliable change-point detection procedures for heteroscedas- 
tic data is to combine the two ideas, that is, to use CV twice. The goal of the present 
section is to properly define such procedures (with various kinds of CV) and to assess their 
performances. 



5.1 Definition of a family of change-point detection procedures 

The general strategy used in this article for change-point detection relies on two steps: 
First, detect where (D — 1) breakpoints should be located for every D £ V n ; second, 
estimate the number (D — 1) of breakpoints. This strategy can be summarized with the 
following procedure: 

Procedure 6 (General two-step change-point detection procedure). 
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1. \JD G V ni A D {P n ) ■= Sm(D) = argmin mg _ Mn(£ ,) {criti(5 m ,P n )} where for every 
model S, criti(5,P n ) G R estimates ||s - ERM(5;P n )||^ and s m = ERM(S' m ;P n ) is 
denned as in Section 13.11 

2. D = argminD G r> n {cruWAo, -fn)}) where for every algorithm Ad, crit2(-4i), f n ) £ ^ 
estimates ||s — -AD(P n )|| n . 

3. Output: the segmentation m(D) and the corresponding estimator 3^,g\ of s. 

Let us now detail which are the candidate criteria criti and crit2 for being used in 
Procedure [H For the first step: 

• The empirical risk ('ERM') is 

criti )E RM(S, Pn) := Pnl (ERM (S; P n )) 

• The Leave-p-out estimator of the risk ('Lpo ') is, for every p G {1, . . . , n — 1}, 

crit 1(Lp o(5', P n ,p) ■= i?L P o p (ERM(S;-),Pn) 

• For comparison, the ideal criterion ('Id') is defined by criti j /rf(5, P n ) : = 
||a-EBM(S;P n )||*. 

As in Section [31 Loo denotes Lpo^ The VFCV estimator of the risk R\y v could also be 
used as criti; it will not be considered in the following because it is computationally more 
expensive and more variable than Lpo (see Section l3~2l) . 
For the second step: 

• Birge and Massart's penalization criterion ('BM') is 

crit 2 ,BM(-4D,-Pn) := Pnl {Ad {Pn)) + VW BM (D) > 

where pen BM (.D) is defined by (fTTI) with c\ = 5, C2 = 2 and C is chosen by the slope 
heuristics (see Section 1 of the supplementary material). 

• The V~-fold cross-validation estimator of the risk ('VFy') is, for every V G {1, . . . , n}, 

crit 2 ,VF v (A D , P n ) ■= RvF v {A D ,P n ) , 

where Rvf v is defined by j9]) and the blocks B%, . . . , By are chosen as in Procedure [5] 
(see Remark [3]). 

• For comparison, the ideal criterion ('Id') is defined by ciit2 id( Ad, P n ) '■= 
\\s-A D (Pn)f n . 

Remark 4. For crit2, definitions using Lpo could theoretically be considered. They are not 
investigated here because they are computationally intractable. 

In the following, the notation [a, /3] is used as a shortcut for "Procedure [6] with critic 
and crit2,/3", and the outputs of [a,/3] are denoted by fh^ a m G M. n and 's\ a M £ <S*. For 
instance, BM coincides with [[ERM, BM]]; Procedures [a, Id] are compared for several a 
in Section El Procedures [ERM,/3] are compared for G {Id,BM,VF 5 } in Section H 
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s. 


a. 


[ERM, VF 5 ] 


[Loo,VF 5 ] 


lLpo 20 ,VF 5 ] 


[ERM, BM] 


1 


c 


5.40 ± 0.05 


5.03 ± 0.05 


5.10 ± 0.05 


3.91 ± 0.03 




pc,l 


11.96 ± 0.03 


10.25 ± 0.03 


10.28 ± 0.03 


12.85 h 0.04 




pc,3 


4.96 ± 0.05 


4.82 ± 0.04 


4.79 ± 0.05 


13.08 ± 0.04 




s 


7.33 ± 0.06 


6.82 ± 0.05 


6.99 ± 0.06 


9.41 ± 0.04 


2 


c 


4.51 ± 0.03 


4.55 ± 0.03 


4.50 ± 0.03 


5.27 ± 0.03 




pc,l 


11.67 ± 0.09 


10.26 ± 0.08 


10.29 ± 0.08 


19.36 ± 0.07 




pc,3 


6.66 ± 0.06 


5.81 ± 0.06 


5.74 ± 0.06 


20.12 ± 0.06 




s 


5.21 ± 0.04 


5.19 ± 0.03 


5.17 ± 0.03 


9.69 ± 0.04 


3 


c 


4.41 ± 0.02 


4.54 ± 0.02 


4.62 ± 0.02 


4.39 ± 0.01 




pc,l 


4.91 ± 0.02 


4.40 ± 0.02 


4.44 ± 0.02 


6.50 ± 0.02 




pc,3 


6.32 ± 0.02 


5.74 ± 0.02 


5.81 ± 0.02 


8.47 ± 0.03 




s 


5.97 ± 0.02 


5.72 ± 0.02 


5.86 ± 0.02 


7.59 ± 0.03 



Table 3: Performance C or (^P) for several change-point detection procedures *p in several 
settings (s,a). Each time, N = 10 000 independent samples have been generated. Next to 
each value is indicated the corresponding empirical standard deviation divided by \/N. 



5.2 Simulation study 
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n)\\l}_ 



A simulation experiment compares procedures [a, VF5] for several a and [ERM, BM], in 
the setting described in Section 13 .3 . 1L A representative picture of the results is given by 
Table [3] [see 2lJ, Chapter 7, for additional results]. The (statistical) performance of each 
competing procedure is measured by 



both expectations being evaluated by averaging over = 10 000 independent samples. 

Remark 5. Birge and Massart's penalization procedure is the only classical change-point 
detection procedure considered in this experiment for two reasons. First, change-point 
detection procedure looking for changes in the distribution of Yi would clearly fail to 
detect changes in the mean of the signal, as soon as the noise-level a varies inside areas 
where the mean is constant. Second, among procedures detecting changes in the mean of a 
signal in a setting comparable to the setting of the paper (that is, frequentist, parametric, 
off-line, with no information on the number of change-points), BM appears to be the most 
reliable procedure according to recent papers 28|, [30|]. The question of the calibration of 
C is addressed in Section 1 of the supplementary material. 

First, BM is consistently outperformed by the other procedures, except in the ho- 
moscedastic settings in which it confirms its strength. 

Second, empirical risk minimization (ERM) slightly outperforms CV (Loo and Lpo 2 o) 
when data are homoscedastic. On the contrary, when data are heteroscedastic, Loo and 
Lpo 20 clearly outperform ERM, often by a margin larger than 10% (for instance, when 
<y = Vpc,i)- Therefore, the results of Section [3] are confirmed when using VF5 (instead of 
Id) for choosing the dimension. 
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framework 


A 






B 






C 




[ERM, BM] 


6.82 ± 


0.03 


7.21 


± 


0.04 


13.49 


± 


0.07 


[ERM, VF 5 ] 


4.78 ± 


0.03 


5.09 


± 


0.03 


7.17 


± 


0.05 


[Loo,VF 5 fl 


4.65 ± 


0.03 


4.88 


± 


0.03 


6.61 


± 


0.05 


[Lpo 20 ,VF 5 fl 


4.78 ± 


0.03 


4.91 


± 


0.03 


6.49 


± 


0.05 


[Lpo 50 ,VF 5 ] 


4.97 ± 


0.03 


5.18 


± 


0.04 


6.69 


± 


0.05 



Table 4: Performance Cor ($P) of several model selection procedures *P in frameworks A, 
B, C with sample size n = 100. In each framework, A" = 10, 000 independent samples 
have been considered. Next to each value is indicated the corresponding empirical standard 
deviation divided by y/~N. 



Third, the comparison between [Lpo p , VF5J for several values of p is less clear. Even 
though p = 1 (that is, Loo) mostly outperforms p = 20 (as well as p = 50, see the 
supplementary material), differences are small and often not significant despite the large 
number of samples generated. The conclusion of the simulation experiment on this question 
is that all values of p between 1 and n/2 all perform almost equally well, with a small 
advantage to p = 1 which may not be general. Let us mention here that the choice of p for 
Lpo„ is usually related to overpenalization [see for instance [2lj, but it seems difficult 



to characterize the settings for which overpenalization is needed for detecting change-points 
given their number. 

5.3 Random frameworks 

In order to assess the generality of the results of Table El the procedures considered in 
Section [531 have been compared in three random settings. The following process has been 
repeated A" = 10, 000 times. First, piecewise constant functions s and a are randomly 
chosen (see Section 2 of the supplementary material for details). Then, given s and a, a 
data sample (t», 5^)i<i< n is generated as described in Section l3.3.1l and the same collection 
of models is used. Finally, each procedure ^ is applied to the sample (ij, 5^)i<i< n , and its 
loss \\s — %}(-P n )||^ is measured, as well as the loss of the oracle inf me ^4 n |[|s — 
To summarize the results, the quality of each procedure is measured by the ratio 
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^8,(7,61, ...,e„ 
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The notation Co r (^J) differs from C or (*P) to emphasize that each expectation includes the 
randomness of s and a, in addition to the one of (ej)i<i<n- 

The results of this experiment — which are reported in Table 0] — mostly confirm the 
results of the previous section (except that all the frameworks are heteroscedastic here), 
that is, whatever p, [Lpo p ,VF5]] outperforms [ERM, VF5], which strongly outperforms 
[ERM, BM] . Similar results — not reported here — have been obtained with a sample size 
n = 200 and N = 1 000 samples. 
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Moreover, the difference between the performances of [[Lpo p ,VF5][ and [ERM, VF5J is 
the largest in setting C and the smallest in setting A. This fact confirms the interpretation 
given in Section [3] for the failure of ERM for localizing a given number of change-points. 
Indeed, the main differences between frameworks A, B and C — which are precisely defined 
in Section 2 of the supplementary material — can be sketched as follows: 

A the partitions on which s is built is often close to regular, and a is chosen indepen- 
dently from s. 

B the partitions on which s is built are often irregular, and a is chosen independently 
from s. 

C the partitions on which s is built are often irregular, and a depends on s, so that the 
noise-level is smaller where s jumps more often. 

In other words, frameworks A, B and C have been built so that for any D S V n , the 
largest variations over M. n (D) of V(m) (defined by J7|)) occur in framework C, and the 
smallest variations occur in framework A. As a consequence, variations of the performance 
of [ERM, VF5J compared to [Lpo p , VF5] according to the framework certainly come from 
the local overfitting phenomenon presented in Section [3l 



6 Application to CGH microarray data 

In this section, the new change-point detection procedures proposed in the paper are 
applied to CGH microarray data. 



6.1 Biological context 

The purpose of Comparative Genomic Hybridization (CGH) microarray experiments is to 
detect and map chromosomal aberrations. For instance, a piece of chromosome can be 
amplified, that is appear several times more than usual, or deleted. Such aberrations are 
often related to cancer disease. 

Roughly, CGH profiles give the log-ratio of the DNA copy number along the chromo- 



somes, compared to a reference DNA sequence [see 13514371. for details about the biological 
context of CGH data]. 

The goal of CGH data analysis is to detect abrupt changes in the mean of a signal (the 
log-ratio of copy numbers), and to estimate the mean in each segment. Hence, change-point 
detection procedures are needed. 

Moreover, assuming that CGH data are homoscedastic is often unrealistic. Indeed, 
changes in the chemical composition of the sequence are known to induce changes in the 
variance of the observed CGH profile, possibly independently from variations of the true 
copy number. Therefore, procedures robust to heteroscedasticity, such as the ones proposed 
in Section [U should yield better results — in terms of detecting changes of copy number — 
than procedures assuming homoscedasticity. 

The data set considered in this section is based on the Bt474 cell lines, which denote 
epithelial cells obtained from human breast cancer tumors of a sixty- year-old woman [36]. 
A test genome of Bt474 cell lines is compared to a normal reference male genome. Even 
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though several chromosomes are studied in these cell lines, this section focuses on chromo- 
somes 1 and 9. Chromosome 1 exhibits a putative heterogenous variance along the CGH 
profile, and chromosome 9 is likely to meet the homoscedasticity assumption. Log-ratios of 
copy numbers have been measured at 119 locations for chromosome 1 and at 93 locations 
for chromosome 9. 



6.2 Procedures used in the CGH literature 

Before applying Procedure [6] to the analysis of Bt474 CGH data, let us recall the definition 
of two change-point detection procedures, which were the most successful for analyzing the 
same data according to the literature [36]. 

The first procedure is a simplified version of BM proposed by Lavielle [H, Section 2] 
and first used on CGH data in [36[]. Note that BM would give similar results on the data 
of Figure [5j 

The second procedure — denoted by 'PML' for penalized maximum likelihood — aims at 
detecting changes in either the mean or the variance, that is breakpoints for (s,a). The 
selected model is defined as the minimizer over m £ M. n of 

critpML(m) := n Alog — ( Y i-s m (ti;P n )) 2 +C"D m , 
AeA m \ nA uei x ) 

where n\ = Card {U G I\} and C" is estimated from data by the slope heuristics algorithm 
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6.3 Results 

Results obtained with BMsimple, PML, [ERM, VF 5 ] and [Lpo 20 , VF 5 ] on the Bt474 data 
set are reported on Figure [5l 

For chromosome 9, BMsimple and PML yield (almost) the same segmentation, so that 
the homoscedasticity assumption is certainly not much violated. As expected, [ERM, VF5J 
and [Lpo 2 o, VF5J also yield very similar segmentations, which confirms the reliability of 
these procedures for homoscedastic signal [seel2l!, Section 7.6 for details]. 

The picture is quite different for chromosome 1. Indeed, as shown by Figure [5] (right), 
BMsimple selects a segmentation with 7 breakpoints, whereas PML selects a segmentation 
with only one breakpoint. The major difference between BMsimple and PML supports at 
least the idea that these data must be heteroscedastic. 

Nevertheless, none of the segmentations chosen by BMsimple and PML are entirely 
satisfactory: BMsimple relies on an assumption which is certainly violated; PML may use 
a change in the estimated variance for explaining several changes in the mean. 

CV-based procedures [ERM, VF5] and [Lpo 20 ,VF5]] yield two other segmentations, 
with a medium number of breakpoints, respectively 4 and 3. In view of the simulation 
experiments of the previous sections, the segmentation obtained via [Lpo 20 , VF5] should 
be the most reliable one since data are heteroscedastic. Therefore, the right of Figure [5] 
can be interpretated as follows: The noise-level is small in the first part of chromosome 1, 
then higher, but not as high as estimated by PML. In particular, the copy number changes 
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Figure 5: Change-points locations along Chromosome 9 (Left) and Chromosome 1 (Right). 
The mean on each homogeneous region is indicated by plain horizontal lines. 
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twice inside the second part of chromosome 1 (as defined by the segmentation obtained 
with PML), indicating that two putative amplified regions of chromosome 1 have been 
detected. 

Note however that choosing among the segmentations obtained with [ERM, VF5J and 
[Lpo 20 , VF5J is not an easy task without additional data. A definitive answer would need 
further biological experiments. 

7 Conclusion 

7.1 Results summary 

Cross-validation (CV) methods have been used to build reliable procedures (Procedure [6]) 
for detecting changes in the mean of a signal whose variance may not be constant. 

First, when the number of breakpoints is given, empirical risk minimization has been 
proved to fail for some heteroscedastic problems, from both theoretical and experimental 
points of view. On the contrary, the Leave-p-out (Lpo p ) remains robust to heteroscedas- 
ticity while being computationally efficient thanks to closed-form formulas given in Sec- 
tion ET2~3l (Theorem QJ . 

Second, for choosing the number of breakpoints, the commonly used penalization pro- 
cedure proposed by Birge and Massart in the homoscedastic framework should not be 
applied to heteroscedastic data. F-fold cross-validation (VFCV) turns out to be a reliable 
alternative — both with homoscedastic and heteroscedastic data — , leading to much better 
segmentations in terms of quadratic risk when data are heteroscedastic. Furthermore, un- 
like usual deterministic penalized criteria, VFCV efficiently chooses among segmentations 
obtained by either Lpo or empirical risk minimization, without any specific change in the 
procedure. 

To conclude, the combination of Lpo (for choosing a segmentation for each possi- 
ble number of breakpoints) and VFCV yields the most reliable procedure for detecting 
changes in the mean of a signal which is not a priori known to be homoscedastic. The 
resulting procedure is computationally tractable for small values of V, since its computa- 
tional complexity is of order 0(Vn 2 ), which is similar to many comparable change-point 
detection procedures. The influence of V on the statistical performance of the procedure 
is not studied specifically in this paper; nevertheless, considering V = 5 only was sufficient 
to obtain a better statistical performance than Birge and Massart 's penalization procedure 
when data are heteroscedastic. When applied to real data (CGH profiles in Section EJ), 
the proposed procedure turns out to be quite useful and effective, for a data set on which 
existing procedures highly disagree because of heteroscedasticity. 

7.2 Prospects 

The general form of Procedure [6] could be used with several other criteria, at both steps of 
the change-point detection procedure. For instance, resampling penalties [B] could be used 
at the first step, for localizing the change-points given their number. At the second step, 
V-fold penalization [3] could also be used instead of VFCV, with the same computational 
cost and possibly an improved statistical performance. 
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Comparing precisely these resampling-based criteria for optimizing the performance of 
Procedure [6] would be of great interest and deserves further works. Simultaneously, several 
values of V should be compared for the second step of Procedure El and the precise influence 
of p when Lpo p is used at the first step should be further investigated. Preliminary results 
in this direction can already be found in [2l|, Chapter 7]. 
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Supplementary material for Segmentation of the mean of 
heteroscedastic data via cross-validation 



Sylvain Arlot and Alain Celisse 
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1 Calibration of Birge and Massart's penalization 

Birge and Massart's penalization makes use of the penalty 



/ ^ CD / / n 

pen BM (L>) := — (5 + 2 log 



In a previous version of this work [f], Chapter 7], C was defined as suggested in 0, 0], 
that is, C = 2K max .j ump with the notation below. This yielded poor performances, which 
seemed related to the definition of C . Therefore, alternative definitions for C have been 
investigated, leading to the choice C = 2i^ t hresh. throughout the paper, where -Kthresh. is 
defined by J2]) below. The present appendix intends to motivate this choice. 

Two main approaches have been considered in the literature for defining C in the 
penalty pen BM : 

• Use C = a 2 any estimate of the noise-level, for instance, 

n/2 

^:=-E( y 2i-^i-l) 2 , (1) 
i=l 

assuming n is even and t\ < ■ ■ ■ < t n . 

• Use Birge and Massart's slope heuristics, that is, compute the sequence 

D(K) := arg mm |p n7 (a a ERM ( D) ) + ^ ( 5 + 21 °g (^)) } > 

find the (unique) K = Kj ump at which D(K) jumps from large to small values, and 
define C = 2Kj ump . 

The first approach follows from theoretical and experimental results 0, 0| which show 
that C should be close to a 2 when the noise-level is constant; ((T|) is a classical estimator 
of the variance used for instance by Baraud 0] for model selection in a different setting. 

The optimality (in terms of oracle inequalities) of the second approach has been proved 
for regression with homoscedastic Gaussian noise and possibly exponential collections of 



s. 


a. 


IK 


2-f^thresh. 


a 2 


^"true 


1 


c 


6.85 ± 0.12 


3.91 ± 0.03 


1.74 ± 0.02 


2.05 ± 0.02 




pc,3 


17.56 ± 0.15 


13.08 ± 0.04 


4.42 ± 0.04 


10.43 ± 0.05 




s 


20.07 ± 0.31 


9.41 ± 0.04 


2.18 ± 0.03 


1.66 ± 0.02 


2 


c 


6.02 ± 0.03 


5.27 ± 0.03 


3.58 ± 0.02 


3.54 ± 0.02 




pc,3 


17.76 ± 0.10 


20.12 ± 0.07 


10.58 ± 0.07 


16.64 ± 0.08 




s 


10.17 ± 0.05 


9.69 ± 0.04 


5.28 ± 0.03 


10.95 ± 0.02 


3 


c 


A C\ l"7 1 C\ C\C\ 

4.97 ± 0.02 


4.39 ± 0.01 


4.62 ± 0.01 


4.21 ± 0.01 




pc,3 


8.66 ± 0.03 


8.47 ± 0.03 


6.64 ± 0.02 


8.00 ± 0.03 




s 


8.50 ± 0.04 


7.59 ± 0.03 


5.94 ± 0.02 


15.50 ± 0.04 




A 


7.52 ± 0.04 


6.82 ± 0.03 


4.86 ± 0.03 


5.55 ± 0.03 




B 


7.89 ± 0.04 


7.21 ± 0.04 


5.18 ± 0.03 


5.77 ± 0.03 




C 


12.81 ± 0.08 


13.49 ± 0.07 


8.93 ± 0.06 


12.44 ± 0.07 



Table 1: Performance C or (BM) with four different definitions of C (see text), in some of 
the simulation settings considered in the paper. In each setting, N = 10 000 independent 
samples have been generated. Next to each value is indicated the corresponding empirical 
standard deviation divided by VN . 



models [H], as well as in a heteroscedastic framework with polynomial collections of models 
[3]. In the context of change-point detection with homoscedastic data, Lavielle [3] and 

can even perform better than C = a 2 when 



Lebarbier 18] showed that C = 2K, 



max. jump 



.jump corresponds to the highest jump of D{K). 
Alternatively, it was proposed in j2] to define C = 2i^ t h re sh. where 



-Kthresh. := min <^ K s.t. D(K) < Ahrcsh. - 



n 



ln(n) 



(2) 



These three definitions of C have been compared with C = a 2 TUC :- 



n 



-i 



the settings of the paper. A representative part of the results is reported in Table [U The 
main conclusions are the following. 

• 2if thresh, almost always beats 2i^ max .j ump , even in homoscedastic settings. This con- 
firms some simulation results reported in 

• a tvue often beats slope heuristics-based definitions of C, but not always, as previously 



noticed by Lebarbier [8||. Differences of performance can be huge (in particular when 
a = a s ), but not always in favour of (T 2 r uc (f° r instance, when s = S3). 

• a 2 yields significantly better performance than af rue in most settings (but not all), 
with huge margins in some heteroscedastic settings. 

The latter result actually comes from an artefact, which can be explained as follows. 
First, 



E 



a- 



1 



n 



i=i 



1 

■2>) ~ s(t 2i -i)) 2 > -} a(U 

i=l 



2 



The difference between these expectations is not negligible in all the settings of the paper. 
For instance, when n = 100, U = i/n and s = s\, n^ 1 Yli( s (^2i) — s(*2i-i)) 2 = 0.04 whereas 
^true varies between 0.015 (when a = cr pc ,i) to 0.093 (when a = a pc> 3). Nevertheless, a 2 
would not overestimate of rue at all in a very close setting: Shifting the jumps of s\ by 
1/100 is sufficient to make n~ l ]Ci( s (*2i) — s(^2«-i)) 2 equal to zero, and the performances 
of BM with C = a 2 would then be very close to the performances of BM with C = <7true- 
Second, overpenalization turns out to improve the results of BM in most of the het- 
eroscedastic settings considered in the paper. The reason for this phenomenon is illustrated 
by the right panel of Figure 4. Indeed, pen BM is a poor penalty when data are het- 
eroscedastic, underpenalizing dimensions close to the oracle but overpenalizing the largest 
dimensions (remember that C = 2i^ t h re sh. on Figure 4). Then, in a setting like (s2,cr pc ^) 
multiplying pen BM by a factor C ove r > 1 helps decreasing the selected dimension; the same 
cause has different consequences in other settings, such as (s\,a s or (s3,cr c ). Neverthe- 
less, even choosing C using both P n and s, (critBM(-D)) rj>o remains a poor estimate of 
(\\ s — ^m ERM (D) ||^) m most heteroscedastic settings (even up to an additive constant). 

To conclude, pen BM with C = a 2 is not a reliable change-point detection procedure, 
and the apparently good performances observed in Table Q] could be misleading. This leads 
to the remaining choice C = 2i^ t hresh. which has been used throughout the paper, although 
this calibration method may certainly be improved. 

Results of Table Q] for C = cr 2 rue indicate how far the performances of pen BM could 
be improved without overpenalization. According to Tables |4] and [5j BM with C = a 2 rne 
only has significantly better performances than [ERM, VF5J or [Loo, VF5J in the three 
homoscedastic settings and in setting (si,a s ). 

Finally, overpenalization could be used to improve BM, but choosing the overpenaliza- 
tion factor from data is a difficult problem, especially without knowing a priori whether 
the signal is homoscedastic or heteroscedastic. This question deserves a specific extensive 
simulation experiment. To be completely fair with CV methods, such an experiment should 
also compare BM with overpenalization to ^-fold penalization [l| with overpenalization, 
for choosing the number of change-points. 

2 Random frameworks generation 

The purpose of this appendix is to detail how piecewise constant functions s and a have 
been generated in the frameworks A, B and C of Section 5.3. In each framework, s and a 
are of the form 

K s -l 

s O) = Yl "i^OjW+i) + a Kj[a Ks ;a Ks+1 \ with Cl = < a X < ■ ■ ■ < tt Ks = 1 

= Yl + PKA[b Ka ;b Ka+1 ] with b = < b x < ■ ■ • < b Ka = 1 

for some positive integers K s , K a and real numbers ao, . . . , uk s £ K and (3q, . . . , (3x a > 0. 



3 



Remark 1. The frameworks A, B and C depend on the sample size n, through the distri- 
bution of K s , K a , and of the size of the intervals [aj;aj + i) and [bj]bj+i). This ensures 
that the signal-to-noise ratio remains rather small, so that the quadratic risk remains an 
adequate performance measure for change-point detection. 

When the signal-to-noise ratio is larger (that is, when all jumps of s are much larger 
than the noise-level, and the number of jumps of s is small compared to the sample size), 
the change-point detection problem is of different nature. In particular, the number of 
change-points would be better estimated with procedures targeting identification (such as 
BIC, or even larger penalties) than efficiency (such as VFCV). 



2.1 Framework A 

In framework A, s and a are generated as follows: 

• K s , the number of jumps of s, has uniform distribution over {3, . . . , |_\/^J}- 

• For < j < K s , 

a _A* | (1 ~ (P. + l)A^ in )^- 
a j+1 - aj - A min H ^ 

with A^ in = min{5/n, 1/(K S + 1)} and Uq, . . . , Uk s are i.i.d. with uniform distri- 
bution over [0; 1] . 

• ao = Vq and for 1 < j < K s , otj = ctj-i + Vj where Vq, . . . ,Vk s are i.i.d. with 
uniform distribution over [—1; —0.1] U [0.1; 1]. 

• K a , the number of jumps of a, has uniform distribution in {5, ... , Lv^J}- 

• For < j < K a , 

b b -A- . (1 ~ + 1)A^ 

- °j ~ A min H ^ Ks 77, 

with A^ in = min{5/n, \ j(K a + 1)} and Uq, . . . , U' K are i.i.d. with uniform distri- 
bution over [0; 1] . 

• 0o, . . . , I3x a are i.i.d. with uniform distribution over [0.05; 0.5]. 

Two examples of a function s and a sample (U,Yi) generated in framework A are plotted 
on Figure [H 



2.2 Framework B 

The only difference with framework A is that Uq, . . . , Uk s are i.i.d. with the same distri- 
bution as Z = |10Zi + 2/2 1 where Z\ has Bernoulli distribution with parameter 1/2 and Z2 
has a standard Gaussian distribution. Two examples of a function s and a sample (U,Yi) 
generated in framework B are plotted on Figure El 
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Figure 1: Random framework A: two examples of a sample (ti, l^)i<i<ioo an d the corre- 
sponding regression function s. 
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Figure 2: Random framework B: two examples of a sample (ti, li)i<i<ioo an d the corre- 
sponding regression function s. 
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2.3 Framework C 



The main difference between frameworks C and B is that [0; 1] is split into two regions: 
«A" S i+i = 1/2 and K s = K s i + K s ^ + 1 for some positive integers K Sj \,K Sj 2, and the 
bounds of the distribution of (3j are larger when bj > 1/2 and smaller when bj < 1/2. Two 
examples of a function s and a sample (tj, Y{) generated in framework C are plotted on 
Figure [3 More precisely, s and a are generated as follows: 

• K s i has uniform distribution over {2, . . . , K m£LX i} with K max \ = \_y/n\ — 1— L(L\/ n — 
1J)/3J. 

• K S 2 has uniform distribution over {0, . . . , K max 2 } with -fC max ,2 = LdV^ ~~ 1 J ) / 3 J . 

• Let Uq, . . . , Uk s be i.i.d. random variables with the same distribution as Z = 
IIOZ1 + Z2I where Z\ has Bernoulli distribution with parameter 1/2 and Zi has 
a standard Gaussian distribution. 



For < j < K, 



s,l, 



a j+1 - a, - A min + 

with A^ = min{5/n, l/(K Sjl + 1)}. 
For K sA + l<j<K s , 

n A s,2 , NMl)^ 
a j+1 - dj - A min H 



Hk=K 3 1+1 ^ fc 



with A^ n = min{5/n, !/(#. 2 + 1)}. 



• ao = Vq and for 1 < j < K s , otj = cty-i + Vj where Vq, . . . ,Vk s are i.i.d. with 
uniform distribution over [—1; —0.1] U [0.1; 1]. 

• K a , (bj + \ — bj)o<j<K a are distributed as in frameworks A and B. 

• Pa, . . . , Px a are independent. 

When bj < 1/2, fij has uniform distribution over [0.025; 0.2]. 
When bj > 1/2, Pj has uniform distribution over [0.1; 0.8]. 



3 Additional results from the simulation study 

In the next pages are presented extended versions of the Tables of the main paper, as well 
as an extended version of Table Q] (Table [7J. 
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Figure 3: Random framework C: two examples of a sample (ti, ii)i<i<ioo an d the corre- 
sponding regression function s. 
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s. 


a. 


ERM 


Loo 


Lpo 20 


Lpo 50 


1 


c 


1.59 ± 0.01 


1.60 ± 0.02 


1.58 ± 0.01 


1.58 ± 0.01 




pc,l 


1.04 ± 0.01 


1.06 ± 0.01 


1.06 ± 0.01 


1.06 ± 0.01 




pc,2 


1.89 ± 0.02 


1.87 ± 0.02 


1.87 ± 0.02 


1.87 ± 0.02 




pc,3 


2.05 ± 0.02 


2.05 ± 0.02 


2.05 ± 0.02 


2.07 ± 0.02 




s 


1.54 ± 0.02 


1.52 ± 0.02 


1.52 ± 0.02 


1.51 ± 0.02 


2 


c 


2.88 ± 0.01 


2.93 ± 0.01 


2.93 ± 0.01 


2.94 ± 0.01 




pc,l 


1.31 ± 0.02 


1.16 ± 0.02 


1.14 ± 0.02 


1.11 ± 0.01 




pc,2 


2.88 ± 0.02 


2.24 ± 0.02 


2.19 ± 0.02 


2.13 ± 0.02 




pc,3 


3.09 ± 0.03 


2.52 ± 0.03 


2.48 ± 0.03 


2.32 ± 0.03 




s 


3.01 ± 0.01 


3.03 ± 0.01 


3.05 ± 0.01 


3.13 ± 0.01 


3 


c 


3.18 ± 0.01 


3.25 ± 0.01 


3.29 ± 0.01 


3.44 ± 0.01 




pc,l 


3.00 ± 0.01 


2.67 ± 0.02 


2.68 ± 0.02 


2.77 ± 0.02 




pc,2 


4.06 ± 0.02 


3.63 ± 0.02 


3.64 ± 0.02 


3.78 ± 0.02 




pc,3 


4.41 ± 0.02 


3.97 ± 0.02 


4.00 ± 0.02 


4.11 ± 0.02 




s 


4.02 ± 0.01 


3.82 ± 0.01 


3.85 ± 0.01 


3.98 ± 0.01 



Table 2: Average performance C or (R3, Id]) for change-point detection procedures *P among 
ERM, Loo and Lpo„ with p = 20 and p = 50. Several regression functions s and noise-level 
functions a have been considered, each time with N = 10 000 independent samples. Next 
to each value is indicated the corresponding empirical standard deviation divided by y/~N, 
measuring the uncertainty of the estimated performance. 



s. 


a. 


Oracle 


VF 5 


BM 


1 


c 


1.59 ± 0.01 


5.40 ± 0.05 


3.91 ± 0.03 




pc,l 


1.04 ± 0.01 


11.96 ± 0.03 


12.85 ± 0.04 




pc,2 


1.89 ± 0.02 


6.43 ± 0.05 


13.03 ± 0.04 




pc,3 


2.05 ± 0.02 


4.96 ± 0.05 


13.08 ± 0.04 




s 


1.54 ± 0.02 


7.33 ± 0.06 


9.41 ± 0.04 


2 


c 


2.88 ± 0.01 


4.51 ± 0.03 


5.27 ± 0.03 




pc,l 


1.31 ± 0.02 


11.67 ± 0.09 


19.36 ± 0.07 




pc,2 


2.88 ± 0.02 


6.58 ± 0.06 


19.82 ± 0.07 




pc,3 


3.09 ± 0.03 


6.66 ± 0.06 


20.12 ± 0.07 




s 


3.01 ± 0.01 


5.21 ± 0.04 


9.69 ± 0.40 


3 


c 


3.18 ± 0.01 


4.41 ± 0.02 


4.39 ± 0.01 




pc,l 


3.00 ± 0.01 


4.91 ± 0.02 


6.50 ± 0.02 




pc,2 


4.06 ± 0.02 


5.99 ± 0.02 


7.86 ± 0.03 




pc,3 


4.41 ± 0.02 


6.32 ± 0.02 


8.47 ± 0.03 




s 


4.02 ± 0.01 


5.97 ± 0.03 


7.59 ± 0.03 



Table 3: Performance C or ([ERM, ^JJ) for <p = Id (that is, choosing the dimension D* := 
argmin De7?n |||s - s^ erm(d) = VFy with V = 5 or = BM. Several regression 

functions s and noise-level functions a have been considered, each time with N = 10 000 
independent samples. Next to each value is indicated the corresponding empirical standard 
deviation divided by \^N, measuring the uncertainty of the estimated performance. 
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s. 


a 




[ERM, VF 5 ] 


[Loo,VF 5 fl 


[Lpo 20 ,VF 5 fl 


[Lpo 50 ,VF 5 l 


[ERM, BMJ 


1 


c 




5.40 ± 0.05 


5.03 ± 0.05 


5.10 ± 0.05 


5.24 ± 0.05 


3.91 ± 0.03 




pc 


1 


11.96 ± 0.03 


10.25 ± 0.03 


10.28 ± 0.03 


10.66 ± 0.04 


12.85 ± 0.04 




pc 


2 


6.43 ± 0.05 


5.83 ± 0.05 


5.99 ± 0.05 


6.20 ± 0.05 


13.03 ± 0.04 




pc 


3 


4.96 ± 0.05 


4.82 ± 0.04 


4.79 ± 0.05 


5.02 ± 0.05 


13.08 ± 0.04 




s 




7.33 ± 0.06 


6.82 ± 0.05 


6.99 ± 0.06 


6.91 ± 0.06 


9.41 ± 0.04 


2 


c 




4.51 ± 0.03 


4.55 ± 0.03 


4.50 ± 0.03 


4.73 ± 0.03 


5.27 ± 0.03 




pc 


1 


11.67 ± 0.09 


10.26 ± 0.08 


10.29 ± 0.08 


10.45 ± 0.09 


19.36 ± 0.07 




pc 


2 


6.58 ± 0.06 


5.85 ± 0.06 


5.85 ± 0.06 


5.49 ± 0.06 


19.82 ± 0.07 




pc 


3 


6.66 ± 0.06 


5.81 ± 0.06 


5.74 ± 0.06 


5.66 ± 0.06 


20.12 ± 0.06 




s 




5.21 ± 0.04 


5.19 ± 0.03 


5.17 ± 0.03 


5.51 ± 0.04 


9.69 ± 0.04 


3 


c 




4.41 ± 0.02 


4.54 ± 0.02 


4.62 ± 0.02 


4.94 ± 0.02 


4.39 ± 0.01 




pc 


1 


4.91 ± 0.02 


4.40 ± 0.02 


4.44 ± 0.02 


4.69 ± 0.02 


6.50 ± 0.02 




pc 


2 


5.99 ± 0.02 


5.34 ± 0.02 


5.42 ± 0.02 


5.75 ± 0.02 


7.86 ± 0.03 




pc 


3 


6.32 ± 0.02 


5.74 ± 0.02 


5.81 ± 0.02 


6.24 ± 0.02 


8.47 ± 0.03 




s 




5.97 ± 0.02 


5.72 ± 0.02 


5.86 ± 0.02 


6.07 ± 0.02 


7.59 ± 0.03 



Table 4: Performance C or (*}3) for several change-point detection procedures *p. Several 
regression functions s and noise-level functions a have been considered, each time with 
N = 10 000 independent samples. Next to each value is indicated the corresponding 
empirical standard deviation. 



Framework 


A 






B 




c 




[ERM, BM] 


6.82 ± 


0.03 


7.21 


± 


0.04 


13.49 ± 


0.07 


[ERM, VF 5 ] 


4.78 ± 


0.03 


5.09 


± 


0.03 


7.17 ± 


0.05 


[Loo,VF 5 fl 


4.65 ± 


0.03 


4.88 


± 


0.03 


6.61 ± 


0.05 


[Lpo 20 ,VF 5 J 


4.78 ± 


0.03 


4.91 


± 


0.03 


6.49 ± 


0.05 


[Lpo 50 ,VF 5 ] 


4.97 ± 


0.03 


5.18 


± 


0.04 


6.69 ± 


0.05 



Table 5: Performance Co r (^2) of several model selection procedures *P in frameworks A, 
B, C with sample size n = 100. In each framework, N = 10, 000 independent samples 
have been considered. Next to each value is indicated the corresponding empirical standard 
deviation divided by y/~N. 
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Framework 


A 






B 






C 




[ERM, BM] 


9.04 ± 


0.12 


11.62 


± 


0.14 


21.21 


± 


0.31 


[ERM, BM ? ] 


5.34 ± 


0.10 


6.24 


± 


0.11 


11.48 


± 


0.22 


[ERM, VF 5 ]] 


5.10 ± 


0.11 


5.92 


± 


0.11 


7.31 


± 


0.14 


[Loo,VF 5 ] 


4.90 ± 


0.11 


5.63 


± 


0.11 


6.89 


± 


0.16 


[Lpo 20 ,VF 5 fl 


4.88 ± 


0.10 


5.55 


± 


0.10 


6.82 


± 


0.15 


[Lpo 50 ,VF 5 i] 


5.11 ± 


0.11 


5.49 


± 


0.10 


7.14 


± 


0.15 



Table 6: Performance C'or'm Of several model selection procedures *P in frameworks A, 
B, C with sample size n = 200. In each framework, N = 1, 000 independent samples have 
been considered. Next to each value is indicated the corresponding empirical standard 
deviation divided by y/~N . 



s. 


0. 


IK 


2-Kthresh. 


a 2 


2 

"true 


1 


c 


6.85 ± 0.12 


3.91 ± 0.03 


1.74 ± 0.02 


2.05 ± 0.02 




pc,l 


70.97 ± 1.18 


12.85 ± 0.04 


1.13 ± 0.02 


10.20 ± 0.05 




pc,2 


23.74 ± 0.26 


13.03 ± 0.04 


3.55 ± 0.04 


10.43 ± 0.05 




pc,3 


17.56 ± 0.15 


13.08 ± 0.04 


4.42 ± 0.04 


10.43 ± 0.05 




s 


20.07 ± 0.31 


9.41 ± 0.04 


2.18 ± 0.03 


1.66 ± 0.02 


2 


c 


6.02 ± 0.03 


5.27 ± 0.03 


3.58 ± 0.02 


3.54 ± 0.02 




pc,l 


17.83 ± 0.10 


19.36 ± 0.07 


8.52 ± 0.06 


15.62 ± 0.08 




pc,2 


17.63 ± 0.10 


19.82 ± 0.07 


10.77 ± 0.07 


16.56 ± 0.08 




pc,3 


17.76 ± 0.10 


20.12 ± 0.07 


10.58 ± 0.07 


16.64 ± 0.08 




s 


10.17 ± 0.05 


9.69 ± 0.04 


5.28 ± 0.03 


10.95 ± 0.02 


3 


c 


4.97 ± 0.02 


4.39 ± 0.01 


4.62 ± 0.01 


4.21 ± 0.01 




pc,l 


7.18 ± 0.03 


6.50 ± 0.02 


4.52 ± 0.02 


6.70 ± 0.03 




pc,2 


8.14 ± 0.03 


7.86 ± 0.03 


6.22 ± 0.02 


7.55 ± 0.03 




pc,3 


8.66 ± 0.03 


8.47 ± 0.03 


6.64 ± 0.02 


8.00 ± 0.03 




s 


8.50 ± 0.04 


7.59 ± 0.03 


5.94 ± 0.02 


15.50 ± 0.04 




A 


7.52 ± 0.04 


6.82 ± 0.03 


4.86 ± 0.03 


5.55 ± 0.03 




B 


7.89 ± 0.04 


7.21 ± 0.04 


5.18 ± 0.03 


5.77 ± 0.03 




C 


12.81 ± 0.08 


13.49 ± 0.07 


8.93 ± 0.06 


12.44 ± 0.07 



Table 7: Performance C or (BM) with four different definitions of C (see text), in some of 
the simulation settings considered in the paper. In each setting, A?" = 10 000 independent 
samples have been generated. Next to each value is indicated the corresponding empirical 
standard deviation divided by \^N. 
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